Method, system and apparatus for improved voice recognition

ABSTRACT

An improved voice recognition system in which a Voice Keyword Table is generated and downloaded from a set-up device to a voice recognition device. The VKT includes visual form data, spoken form data, phonetic format data, and an entry corresponding to a keyword, and TTS-generated voice prompts and voice models corresponding to the phonetic format data. A voice recognition system on the voice recognition device is updated by the set-up device. Furthermore, voice models in the voice recognition device are modified by the set-up device.

BACKGROUND OF THE INVENTION

The present invention relates to voice recognition, and morespecifically to improving the performance of a voice recognitionapparatus.

A Voice Keypad (VKP) is a device with the ability to recognize keywordsuttered by a user and generate corresponding outputs, for example,commands or text-strings, for use by an application device.

One implementation of a VKP is a Bluetooth speakerphone for use with amobile telephone provided with Bluetooth functionality. In such adevice, the VKP speakerphone and mobile telephone are paired. A voicerecognition engine on the VKP is implemented to recognize a name utteredby a user with reference to a user-defined name list and output acorresponding telephone number. A dialing function on the mobiletelephone then dials the number, and the user is able carry on aconversation through the mobile telephone via the speakerphone.

There are three general classes of voice recognition, namely speakerindependent (SI), speaker dependent (SD) and speaker adapted (SA). Inthe SI system, a voice recognition engine identifies utterancesaccording to universal voice models generated from samples obtained froma large training population. As no individual training by the user isrequired, such systems are convenient. However, these systems generallyhave low recognition performance, especially when used by speakers withheavy accents or whose speech patterns otherwise diverge from thetraining population. On the other hand, SD systems require users toprovide samples for every keyword, which can become burdensome andmemory intensive for large lists of keywords.

Conventional SA systems achieve limited improvement of recognitionperformance by adapting voice models according to speech input by anindividual speaker. However, it is desirable to achieve a still higherrecognition rate for keywords on a VKP. Furthermore, the VKP itself maylack the appropriate resources to achieve improved voice recognition.

SUMMARY

Provided are a method, system, and apparatus for improved voicerecognition.

In an embodiment of the present invention, a method for improved voicerecognition in a system having a set-up device and a voice recognitiondevice is provided. The method comprises the steps of generating a VoiceKeyword Table (VKT) and downloading the VKT to the voice recognitiondevice; upgrading a voice recognition system on the voice recognitiondevice; and modifying a voice model in the voice recognition device.

The VKT preferably comprises visual form data, spoken form data,phonetic format data, and an entry corresponding to a keyword, andTTS-generated voice prompts and voice models corresponding to thephonetic format data. The step of generating a VKT preferably comprisesthe steps of inputting visual form data and entry data; transformingvisual form data to default spoken form data; mapping spoken form datato phonetic format; and performing TTS-guided-pronunciation editing tomodify phonetic format data. In preferred embodiments, an additionalstep of a confusion test using the phonetic format data, voice modelsand a confusion table to identify keywords in a confusion set isperformed. Furthermore, additional steps may be taken to eliminatekeywords from the confusion set.

In preferred embodiments, a user-initiated step of modifying a voicemodel in the voice recognition device comprises the steps of building akeyword model from keywords in the VKT; selecting keywords foradaptation; obtaining new speech input for selected keywords; adaptingvoice models for selected keywords using existing keyword voice modelsand new speech input to produce adapted voice models; and downloadingadapted speech models to the voice recognition device.

Alternately or in addition thereto, a new-model-availability-initiatedstep of modifying a voice model in the voice recognition devicecomprises the steps of downloading a new voice model from a network tothe set-up device; if the new voice model is a newer version than thevoice model on the voice recognition device, determining if accumulatedpersonal acoustic data exists; if accumulated personal acoustic dataexists, uploading the VKT from the voice recognition device to theset-up device, building a keyword model for adaptation from keywords inthe uploaded VKT, performing adaptation using the new voice model andaccumulated personal data to produce an adapted new voice model, anddownloading the adapted new voice model to the voice recognition device;and if no accumulated speech data exists, uploading the VKT to theset-up device, and building a keyword model for keywords in the uploadedVKT using the new voice model, and downloading the updated new voicemodel to the voice recognition device. The accumulated personal acousticdata may be, for example, speech input recorded during user-initiatedadaptation of voice models and stored on the set-up device or speechinput recorded during use of the voice recognition device to identifykeywords and stored on the voice recognition device.

In preferred embodiments, the step of upgrading and downloading a voicerecognition system to the voice recognition device comprises the stepsof downloading an updated voice recognition system to the set-up devicevia a network; determining if the updated voice recognition system ismore recent than a voice recognition system on the voice recognitiondevice; and if the updated voice recognition system is more recent,downloading the updated voice recognition system from the voicerecognition device to the set-up device.

In preferred embodiments, run-time information is saved in the voicerecognition device; saved run-time information is up-loaded from thevoice recognition device to the set-up device; the up-loaded run-timeinformation is processed on the set-up device; and the voice recognitiondevice is updated according to the results of the processing of run-timeinformation on the set-up device to improve voice recognitionperformance.

In addition, the method preferably includes one or more of the steps ofinitiating a diagnostic test on the voice recognition device by theset-up device, providing customer support over a network, and providingwireless capable device compatibility support comprising instructionsfor pairing the voice recognition device with a wireless capableapplication device.

In an embodiment of the present invention, a voice recognition systeminstalled on a set-up device for improving voice recognition on a voicerecognition device is provided. The voice recognition system comprises aVoice Keyword Table (VKT) generating means for generating a VKT anddownloading the VKT to the voice recognition device; and means forupdating voice models on the voice recognition device. The VKTpreferably comprises visual form data, spoken form data, phonetic formatdata, and an entry corresponding to a keyword, and TTS-generated voiceprompts and voice models corresponding to the phonetic format data.

In preferred embodiments, the voice recognition system further comprisesmeans for performing a confusion test using the phonetic format data,voice models and a confusion table to identify keywords in a confusionset, and eliminating keywords from the confusion set. In addition, thevoice recognition system further preferably comprises means for updatingthe voice recognition device according to the results of the processingof run-time information saved on the voice recognition device to improvevoice recognition performance.

In preferred embodiments, the voice recognition system further comprisesmeans for user-initiated and/or new-model-availability-initiatedadaptation of voice models on the voice recognition device. The meansfor new-model-availability-initiated adaptation preferably usesaccumulated personal acoustic data recorded during user-initiatedadaptation of voice models on the voice recognition device or recordedduring operation of the voice recognition device to identify keywords.

In preferred embodiments, the voice recognition system further comprisesone or more means for upgrading and downloading a voice recognitionsystem to the voice recognition device, means for initiating adiagnostic test on the voice recognition device, means for providingcustomer support via a network, and means for providing wireless capabledevice compatibility support comprising instructions for pairing thevoice recognition device with a wireless capable application device.

In an embodiment of the present invention, an apparatus for improvedvoice recognition is provided. The apparatus comprises a set-up devicecomprising a first Voice Keyword Table (VKT) and a first voicerecognition system; and a voice recognition device comprising a secondVKT corresponding to the first VKT and a second voice recognitionsystem, the voice recognition device connectible to the set-up devicethrough an interface. The VKT comprises visual form data, spoken formdata, phonetic format data, and an entry corresponding to a keyword, andTTS-generated voice prompts and voice models corresponding to thephonetic format data. The voice recognition device is preferably a VoiceKey Pad (VKP) device or a wireless earset. The set-up device ispreferably a personal computer (PC).

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will now be described, by way of example, withreference to the accompanying drawings, in which:

FIG. 1 is a block diagram of a voice recognition (VR) apparatusaccording to an embodiment of the present invention;

FIG. 2A is a block diagram of a set-up device according to an embodimentof the present invention;

FIG. 2B is a block diagram of a Voice Keyword Table (VKT) on the set-updevice according to an embodiment of the present invention;

FIG. 3A is a block diagram of a VR device according to an embodiment ofthe present invention;

FIG. 3B is a block diagram of a corresponding VKT on the VR deviceaccording to an embodiment of the present invention;

FIG. 4 is a block diagram of an application device according to anembodiment of the present invention;

FIG. 5 is a flow diagram of a method of improved voice recognitionaccording to an embodiment of the present invention;

FIG. 6A is a flow diagram of a method of generating a VKT according toan embodiment of the present invention;

FIG. 6B is a flow diagram of a method of performingTTS-guided-pronunciation editing according to an embodiment of thepresent invention;

FIG. 7A is a flow diagram of a method of upgrading the set-up device VRsystem according to an embodiment of the present invention;

FIG. 7B is a flow diagram of downloading an updated version of the VRdevice system to the set-up device according to an embodiment of thepresent invention;

FIG. 7C is a flow diagram of a method of updating the VR system on theVR device according to an embodiment of the present invention;

FIG. 8 is a flow diagram of a method of user-initiated voice modeladaptation according to an embodiment of the present invention;

FIG. 9A is a flow diagram of a method of downloading new voice models tothe set-up device according to an embodiment of the present invention;

FIG. 9B is a flow diagram of a method ofnew-model-availability-initiated voice model adaptation according to anembodiment of the present invention; and

FIG. 10 is a flow diagram of a method of performing a diagnostic routineon the VR device according to an embodiment of the present invention.

DESCRIPTION

FIG. 1 is a block diagram of a voice recognition (VR) apparatusaccording to an embodiment of the present invention.

In a preferred embodiment of the invention, the VR apparatus comprises aset-up device 100, a voice recognition (VR) device 200, and anapplication device 300. The set-up device may be, for example, apersonal computer or personal digital assistant (PDA).

The VR device 200 may be, for example, a headset, a speakerphone, anearset or an earset/speakerphone combo with VR functionality. Inpreferred embodiments, VR device 200 is a Voice Keypad (VKP), namely adevice with the ability to recognize keywords uttered by a user andgenerate corresponding outputs, for example, commands or text-strings,for use by an application device.

The application device 300 is a device that performs a function underthe control of the VR device 200. The application device 300 may be, forexample, a mobile telephone, a PDA, a global positioning device, a homeappliance or information appliance, a personal computer, a controlsystem for a DVD/MP3 player, a car radio, or a car function controller.

Set-up device 100, VR device 200 and application device 300 areconnected by wired or wireless connections. In a preferred embodiment,set-up device 100 is connected to VR device 200 by a USB interface,while VR device 200 is connected to application device 300 by a wirelessinterface, for example, Bluetooth.

In the embodiment described below, the set-up device 100 is a personalcomputer, the VR device 200 is a wireless earset, and the applicationdevice 300 is a mobile telephone. However, it is understood that thisembodiment is exemplary in nature and in no way intended to limit thescope of the invention to this particular configuration.

In this embodiment, VR device 200 may be used as a VKT for dialingnumbers and entering commands on the application device 300. Inconjunction therewith, VR device 200 provides conventional wirelessearset functionality, namely, audio input/output for conversation andother communication via the mobile telephone. It is understood that whenconnected to set-up device 100, VR device 200 may also serve as an audioinput/out device for the set-up device.

In the case where application device 300 is simply a control system, forexample, a control system for a DVD/MP3 player, VR device 200 may beused to transmit commands thereto, and no audio input/outputfunctionality via the application device 300 need be provided.

FIG. 2A is a block diagram of a set-up device 100 according to anembodiment of the present invention.

In a preferred embodiment of the present invention, set-up device 100 isa personal computer comprising controller 101, voice recognition system(VRS) 102, display 120, input 130, storage 180, and interface 190.

The controller 101 may be, for example, a microprocessor and relatedhardware and software for operating the set-up device 100. Display 120may be, for example, a monitor such as a LCD monitor. Input device 130may be a keyboard/mouse or other conventional input device or devices.Storage 180 is a memory or memories, for example, a hard drive or flashmemory, and is used for storing new voice models and personalaccumulated acoustic data, as will be described in further detail below.An interface 190 for connecting to VR device 200 is also provided, forexample, a USB interface, a wireless interface such as Bluetooth, or an802.11 wireless network interface. Furthermore, set-up device 100 isconnected to a network, for example, a global network such as the WorldWide Web.

In a preferred embodiment, VRS 102 comprises a Voice Keyword Table (VKT)110 and a number of modules implemented in software and/or hardware onset-up device 100. As will be described in further detail in connectionwith FIGS. 5-10, the modules preferably include Voice Keyword Table(VKT) generation module 150 including a TTS-guided-pronunciation editingmodule 151 and a confusion test module 152, system upgrade module 155,voice model update module 160 including an adaptation module 161,diagnostics module 165, customer support module 170, and wirelesscapable device compatibility module 175. In a preferred embodiment, theVKT and, to the extent that they are software, the modules, are storedin a memory or memories of set-up device 100.

FIG. 2B is a block diagram of VKT 110 according to an embodiment of thepresent invention.

In a preferred embodiment of the invention, VKT 110 comprises table 111,voice model database 112, and TTS-generated voice prompt database 113.Table 111 stores pre-defined keywords, such as HOME and SET-UP MENU, anduser-defined keywords such as BRIAN, RYAN and JOSE, and entry datacorresponding to the keywords. Entry data may be text-strings, such astelephone numbers, or commands, such as a command for entering a set-upmenu.

As will be described in further detail below, in preferred embodiments,table 111 stores visual form data corresponding to any visual symbol theuser uses to represent a keyword in the VKT 110, and spoken form datacorresponding to an utterance associated with the keyword. In addition,table 111 comprises phonetic format data corresponding to the spokenform data.

It is understood that depending on the application device used inconjunction with the VR device, keywords of different categorizationsmay be employed. Namely, pre-defined and user-defined keywords mayinclude command functions related to the features of any particularapplication device. For example, if the application device is a MP3player, the keywords may include pre-defined MP3 player commands such asSTOP or RANDOM, user-defined commands, and others. The commands may alsobe associated with operation of the VR device itself. For example, thecommand SET-UP MENU may activate a voice prompt interface on the VRdevice.

Furthermore, the entry data is not limited to text-strings and commands.For example, entry data may include images, wave files, and other fileformats. It is further contemplated that more than one entry field beassociated with a given keyword. It is also contemplated that the VKTmay store speaker dependent voice tags and corresponding speakerdependent voice models and entry data.

Voice model database 112 stores the current set of voice models for thesystem. In embodiments of the invention, a voice model generating moduleof VRS 102 generates voice models corresponding to the phonetic formatdata for keywords in VKT 110 to populate voice model database 112. Aswill be explained in further detail below, the voice models may compriseuniversal speaker-independent (SI) voice models and/or speaker-adapted(SA) voice models adapted according to embodiments of the presentinvention.

TTS-generated voice prompt database 113 stores data for the generationof text-to-speech (TTS) voice prompts used in embodiments of the presentof invention. In embodiments of the invention, a TTS-module of VRS 102generates speech wave files corresponding to the phonetic format datafor keywords in VKT 110 to populate voice prompt database 113.

Additional features of VKT 110 are described in following sections inconnection with FIGS. 5-10.

FIG. 3A is a block diagram of VR device 200 according to an embodimentof the present invention.

In a preferred embodiment of the present invention, VR device 200comprises controller 201, voice recognition system (VRS) 202 comprisingVKT 210 and voice recognition engine (VRE) 220, speaker 230, microphone240, battery 250, storage 280, and interface 290.

The controller 201 may be, for example, a microprocessor and relatedhardware and software for operating the VR device 200 and performingdigital signal processing on audio input received by microphone 240.Speaker 230 is a conventional speaker for outputting audio. Microphone240 may be a single microphone or an array microphone, and is preferablya small array microphone (SAM). Storage 280 is a memory or memories,preferably a flash memory, and is used for storing run-time informationand/or personal accumulated acoustic data, as will be described infurther detail below. Interface 290 is provided for connecting withset-up device 100 and application device 300. For example, a USBinterface may be provided for connecting to set-up device 100, while awireless interface may be provided for connecting to application device300. In the case where VR device 200 connects to both devices by awireless connection, the interface may comprise a single wirelessinterface (for example, Bluetooth) or multiple wireless interfaces (forexample one Bluetooth and one 802.11 wireless network).

VKT 210 corresponds to VKT 110, and, as shown in FIG. 3B, comprisescorresponding table 211, voice model database 212, and TTS-generatedvoice prompt database 213.

In preferred embodiments, VRE 220 receives signals generated bymicrophone 240 and processed by controller 201, extracts feature datafor comparison with voice models stored in voice model database 212 soas to determine if the utterance matches a keyword in VKT 210. As thefeatures and operation of voice recognition engines are well known inthe art, further description is not provided here.

It is a feature of embodiments of this invention that VKT 110 ismirrored in VKT 210. Namely, data entered into VKT 110 may be synched toVKT 210, and vice versa, when the corresponding devices are connected.

In embodiments of the present invention, VR device 200 includesfunctionality to receive data input independent from set-up device 100.For example, VR device 200 may include a voice prompt guided interfacefor adding data to VKT 210. In this case, newly adding data in VKT 210may be synched to VKT 110 when the corresponding devices are connected.

It is a feature of the preferred embodiment of the present inventionthat run-time information collected in the operation of VR device 200 isstored in storage 280. When VR device 200 is connected to set-up device100, the run-time information is uploaded from VR device 200 to theset-up device 100 and processed by VRS 102 for the purpose of improvingvoice recognition performance. The VR device 200 may then be updatedaccording to the results of the processing of run-time information andimproved voice recognition performance. An example of the kind ofrun-time information that may be stored is acoustic data correspondingto successful keyword recognitions and/or data obtained from applicationdevice 300.

FIG. 4 is a block diagram of application device 300 according to anembodiment of the present invention.

In a preferred embodiment of the present invention in which applicationdevice 300 is a mobile telephone, application device 300 comprises acontroller 301, an RF module 310 with an antenna for connecting to acommunications network, a control program 302 comprising a dialingmodule 320 stored in a memory, a speaker 330 and a microphone 340. Aninterface 390 is provided for connecting to SR device 200, for example,a wireless interface such as Bluetooth. As the features and structure ofa mobile telephone are well known in the art, further description is notprovided here.

In general, a user operates VR device 200 to control application device300. In the embodiment where application device 300 is a mobiletelephone, for example, if a user wishes to dial a contact RYAN, he orshe utters the keyword RYAN into microphone 240. After front-end digitalsignal processing, VRS 202 determines a matching keyword, if any. Ifthere is a keyword match, entry data corresponding to the matchedkeyword is transmitted from VR device 200 to application device 300 viainterfaces 290 and 390. If, for example, the entry data corresponding toRYAN is a telephone number, a dialing module receives the telephonenumber and dials the contact RYAN. It is understood that the system mayalso include other conventional functions such as a voice promptfeedback step allowing the user to confirm or reject a keyword match.

It is another feature of preferred embodiments of the present inventionthat during normal use of the VR device 200, personal acoustic data isrecorded and accumulated in storage 280 for later use in adaptation. Forexample, if the user utters the keyword RYAN and the user confirms thematch determined by VRS 202, the recorded utterance is stored in storage280 along with data associating the recorded utterance with the keywordRYAN. It is further understood that other methodologies may be employedto determine if VRS 202 successfully matched the keyword.

Furthermore, the user may operate VR device 200 to control the VR deviceitself. For example, if the user utters SET-UP menu, controller 201 maycause the VR device to output a voice guided set-up menu via speaker230.

The operation of the voice recognition apparatus and component partsthereof is described in further detail below.

FIG. 5 shows the basic process flow of a preferred embodiment of VRS 102for achieving improved voice recognition of the present invention. Steps400-430 are described in further detail in connection with FIGS. 6-10.

In step 400, VKT 110 is generated on the set-up device 100 anddownloaded to the VR device 200, where it is stored in a memory as VKT210.

In step 410, one or both of VRS 102 and VRS 202 are upgraded.

In step 420, voice models are modified and downloaded from set-up device100 to VR device 200.

In step 430, a diagnostics routine is performed on VR device 200.

In step 440, remote customer support is provided. In a preferredembodiment, an interface may be provided via display 120 and input 130allowing a user to link to a knowledgebase or other customer supportservices. In addition, manual download of updated software and voicemodels may be performed through this interface.

In step 450, remote wireless capable device compatibility support isprovided. In a preferred embodiment, an interface is provided on display120 for the user to link to a wireless capable device compatibilitydatabase over a network using input device 130. In a preferredembodiment, the network comprises a web server. For example, in anembodiment of the present invention in which application device 300 is amobile telephone with Bluetooth functionality, the database containsspecific instructions for pairing VR device 200 with various makes andmodels of mobile telephones.

It is understood that the present invention is not intended to belimited to the performance of all of steps 400-450, or performance ofthe steps in the above-described order, although in a most preferredembodiment each of steps 400-450 is performed.

FIG. 6A shows the steps of generating a VKT according to a preferredembodiment of the present invention.

In step 500, keyword data is inputted into visual form and correspondingentry fields of table 111. For example, in a preferred embodiment, datamay be extracted from a software application by VKT generation module150 to populate the visual form and entry data fields of table 111.Manual input, or editing of extracted data may also be performed toinput data into table 111.

In a preferred embodiment of the present invention, visual form, spokenform, and entry data is displayable on display 120 and may beentered/edited in table 111 with input device 130.

For example, in an embodiment of the present invention where applicationdevice 300 is a mobile telephone and set-up device 100 is a personalcomputer, the user may elect to extract data from an online telephoneprogram account or an email address book located on set-up device 100 oraccessed by set-up device 100 via a network to populate the visual formand entry data fields of table 111. In this case, VKT generation module150 extracts relevant data and populates table 111. The table may thenbe edited by amending, adding, or deleting keywords and entries (forexample, names and telephone numbers) according to the user'spreference.

In step 510, visual form data is transformed into spoken form data.Visual form data corresponds to any visual symbol the user uses torepresent a keyword in the VKT. On the other hand, spoken form datacorresponds to an actual utterance associated with the keyword. In apreferred embodiment, default spoken form data is automaticallygenerated from visual form data by VKT generation module 150. If thekeywords are in a language in which the visual form data can also serveas the basis for word-to-phoneme translation and is easily edited by auser to achieve different pronunciations, the visual form data maysimply be copied into the spoken form data. For example, if the keywordis RYAN, the visual form data and the default spoken form data are thesame. On the other hand, for a language such as Chinese, in which thevisual form data cannot serve as the basis for word-to-phonemetranslation and is not easily edited to achieve differentpronunciations, a word-to-pinyin translation or the like may be employedto generate the default spoken form data in pinyin or other alphabetconversion format. Thus, if the keyword is the Chinese word for “flower”and word-to-pinyin translation were employed, the visual form data wouldbe the Chinese character for flower

and the default spoken form data would be the pinyin translationthereof, i.e., “HUA”.

The user may also add or edit spoken form data by manual entry throughinput device 130. For example, in table 111, the default spoken formdata for keywords BRIAN and JOSE is BRIAN and JOSE, but for reasonsexplained in further detail in the following, the spoken form data hasbeen edited to BRIAN SMITH and HOSAY.

In step 515, spoken form data is mapped to phonetic format data by VKTgeneration module 150 by a word-to-phoneme translation module utilizinga pronunciation dictionary and pronunciation rules.

In step 520, TTS-guided-pronunciation editing is performed by theTTS-guided-pronunciation editing module 151. This step is shown infurther detail in FIG. 6B, in which the following steps are performed.

In step 550, the user selects a keyword. Subsequently, in step 560, aTTS-generated voice prompt is generated by VKT generation module 150according to the phonetic format data currently stored corresponding tothe selected keyword and TTS-generated voice prompt database 113. If theuser is satisfied with the output, the routine is ended and, at theuser's option, another keyword may be selected. The voice prompt ispreferably outputted by speaker 230 of VR device 200 if VR device 200 isconnected to set-up device 100. Alternately, a speaker or other audiooutput device of set-up device 100 (not shown) may be used.

If the user is not satisfied with the output, the user may in step 570edit the spoken form data in table 111. The edited spoken form data isin turn mapped to phonetic form a in step 580, and the routine returnsto step 560 to determine if the user is satisfied with the modification,or if further editing of the spoken form data is required to bring thepronunciation generated by the TTS-generated voice prompt closer to thedesired pronunciation.

For example, in the case of a keyword JOSE, the default spoken form datais JOSE. However, the mapped phonetic format data for JOSE is

, which sounds like JOE-SEE when the voice prompt is generated. If thispronunciation is unsatisfactory to the user, the user may edit thespoken form data to HOSAY, for which the mapped phonetic format data isho'zei. The voice prompt generated corresponding to this phonetic formatdata sounds like the Spanish-language pronunciation of the word Jose.

Returning to FIG. 6A, in step 530, in a preferred embodiment of thepresent invention a confusion test is performed on VKT 110 by confusiontest module 152 in which phonetic format data corresponding to keywordsis analyzed such that keywords are recognized as members of a confusionset and distinguished. Namely, phonetic format data from table 111,corresponding voice models from voice model database 112, and aconfusion table are used to generate a confusion matrix to check andpredict the recognition performance for the keywords and provideguidance to the user for improving performance. For example, the spokenform data may be changed to obtain a different pronunciation, a prefixor suffix may be added to the keyword, or adaptation may be performed onthe confusable words.

For example, on determination of a confusion set, the user may elect toedit spoken form data for one or more of the confused terms, therebyreturning the routine to step 510. In the case where the keywords areBRIAN and RYAN, phonetic format data mapped from the default spoken formdata (BRIAN and RYAN), may be identified as a confusion set based on thevoice models present in voice model database 112. Once identified to theuser as such, the user may elect to edit the spoken form data for BRIANto BRIAN SMITH. New phonetic format data is then mapped from the editedspoken form data in step 515.

It is a feature of embodiments of the present invention that the sameset of phonetic format data is shared between TTS-guided-pronunciationediting and voice recognition. Namely, the user edits the pronunciationof a keyword guided by TTS-guided-pronunciation editing to be close tohis/her own accent. Furthermore, the phonetic format data mapped fromspoken form data that is the result of the TTS-guided-pronunciationediting process is used in the generation of voice models stored invoice model databases 112/212. Thus, the voice models correspond moreclosely to the specific pronunciation of the user and the recognitionperformance of VRS 202 can be improved.

FIG. 7A is a flow diagram of a preferred method of upgrading VRS 102.

In step 600, the system upgrade module 155 accesses a remote server viaa network to determine if an updated version of the VRS 102 isavailable.

In step 610, if an updated version of the VRS 102 is available, the useris prompted regarding the availability of the upgrade.

If the user confirms the upgrade in step 610, in step 620 the updatedversion of VRS 102 is downloaded to the set-up device 100 via thenetwork and stored in storage 180.

In step 640, the updated version of VRS 102 is installed on set-updevice 100.

FIGS. 7B and 7C show flow diagrams of a preferred method of upgradingVRS 202.

In step 650, the system upgrade module 155 accesses a remote server viaa network to determine if an updated version of the VRS 202 isavailable.

In step 660, if an updated version of the VRS 202 is available, the useris prompted regarding the availability of the upgrade.

If the user confirms the upgrade in step 660, in step 670 the updatedversion of VRS 202 is downloaded to the set-up device 100 via thenetwork and stored in storage 180.

Then, with reference to FIG. 7C, in step 700, the VR device 200 isconnected with the set-up device 100.

In step 710, system upgrade module 155 checks the version of VRS 202installed on VR device 200.

If the updated version of VRS 202 is newer than the version installed onVR device 200, the user is prompted regarding the availability of anupgrade.

If the user confirms an upgrade, in step 730, the updated version of VRS202 is downloaded to the VR device 200 and installed.

In preferred embodiments of the present invention, voice models aremodified and downloaded to VR device 200 in two different ways:user-initiated and new-model-availability-initiated.

FIG. 8 is a flow diagram of a method of performing user-initiatedadaptation of voice models on VR device 200 according to an embodimentof the present invention.

In step 801, the user profile is obtained by voice model update module160.

In step 802, keyword models are built for adaptation for keywords in VKT110. In preferred embodiments of the present invention, pre-definedkeyword and digit models are built in advance, and only user-definedkeywords models need to be built for adaptation in this step.

In step 803, the user is prompted to select a category for adaptation.The categories may include pre-defined keywords, digits, or user-definedkeywords. As noted, pre-defined keyword are defined by the system, suchas HOME corresponding to a text-string or SET-UP MENU corresponding to acommand. User-defined keywords are those extracted during creation ofthe VKT 110 or entered by other means. Digits are the numerals0-1-2-3-4-5-6-7-8-9.

In step 804, the user is prompted to select a mode. For example, theuser may choose to adapt all keywords, new keywords, or manually selectthe keywords to adapt.

In step 805, an adaptation engine 161 in voice model update module 160performs an adaptation using accumulated personal acoustic datacorresponding to the user profile (if any), the currently existing voicemodels (for example, the original SI voice models or previously adaptedvoice models) stored in voice model database 112, and new speech inputprovided by the user to produce adapted voice models for download. Inthis step, the system is preferably trained with a number of utterancescorresponding to keywords in the selected category as determined by theselected mode to improve the recognition performance of the system for agiven user. Adaptation techniques are well known in the art and are notdiscussed in further detail here.

In a preferred embodiment, VR device 200 is connected to set-up device100 and new speech input is captured via microphone 240. Otherwise, newspeech input may be inputted by a microphone provided with set-up device100 (not shown).

It is a feature of preferred embodiments of the present invention thatpersonal acoustic data is recorded and accumulated in storage 180 inassociation with the user profile during user-initiated adaptation. Forexample, if the user provides new speech input for the keyword RYAN, therecorded utterance is stored in storage 180 along with data associatingthe recorded utterance with the keyword RYAN.

In step 806, adapted voice models are downloaded from set-up device 100to VR device 200 and stored in voice model database 212.

FIGS. 9A and 9B illustrate a method of modifying voice models on VRdevice 200 initiated by the availability of new voice models on anetwork according to an embodiment of the present invention.

First, as shown in FIG. 9A, new voice models are downloaded to theset-up device.

In step 810, a remote server is accessed via a network to determine ifnew voice models are available. New voice models may be, for example,new SI models developed reflecting improvements in the art or directedto a specific speaker group and stored on a remote server.

In step 811, if new voice models are available, the user is promptedregarding the availability of the update.

In step 812, if the user confirms the update, the new voice models aredownloaded to the set-up device 100 via the network and saved in storage180.

FIG. 9B is a flow diagram of a method ofnew-model-availability-initiated voice model adaptation according to anembodiment of the present invention

In step 815, the user profile is obtained.

In step 816, the VR device 200 is connected to set-up device 100.

In step 817, voice model update module 160 compares the versions of thevoice models in voice model database 212 on the VR device 200 with thenew voice models stored in storage 180 on set-up device 100. If thereare newer versions available on the set-up device, the user is promptedregarding the available upgrade.

If the user confirms the upgrade, in step 818, voice model update module160 checks to determine if accumulated personal acoustic datacorresponding to the user profile is available. For example, personalacoustic data accumulated during previous user-initiated adaptation maybe stored in storage 180. Furthermore, personal acoustic dataaccumulated during normal operation of VR device 200 and stored instorage 280 may be uploaded to storage 180 and associated with the userprofile.

If so, in step 820, VKT 210 is uploaded into a memory in set-up device100.

In step 825, voice model update module 160 builds keyword models foradaptation. In preferred embodiments of the present invention,pre-defined keyword and digit models are built in advance. Thus, onlyuser-defined keywords models need to be built for adaptation in thisstep.

In step 830, adaptation module 161 performs an adaptation using thebuilt-keyword models, new voice models and the accumulated personalacoustic data to generate adapted new voice models. In this step, theaccumulated personal acoustic data is used as speech input by theadaptation module 161. This allows for adaptation of the new models tooccur without the need for new speech input by the user.

In step 835, adapted new voice models are downloaded to VR device 200.

If, on the other hand, no accumulated personal acoustic data exists, instep 840, VKT 210 is uploaded into a memory in set-up device 100.

In step 845, voice model update module 160 builds keyword models usingthe new voice models. In preferred embodiments of the present invention,pre-defined keyword and digit models are built in advance. Thus, onlyuser-defined keywords models need to be built for adaptation in thisstep.

In step 850, updated new voice models are downloaded to VR device 200.

FIG. 10 shows an exemplary flow diagram of a method of performing adiagnostic routine according to an embodiment of the present invention.

In step 900, the VR device 200 is connected to set-up device 100.

In step 910, diagnostics module 165 checks the connection between the VRdevice 200 and the set-up device 100.

In step 920, diagnostics module 165 checks the flash content of memoryin which VR system 202 is stored.

In step 930, diagnostics module 165 checks the battery status of battery250.

In step 940, diagnostics module 165 checks the functioning of speaker230. In a preferred embodiment of the invention, a test prompt istransmitted to the VR device 200 and output through speaker 230. If theuser hears the voice prompt, the user inputs a positive acknowledgementthrough input 130 of set-up device 100. Otherwise, the user inputs anegative acknowledgement through input 130 and the test is a fail.

In step 950, diagnostics module 165 checks the functioning of microphone240. In a preferred embodiment of the invention, the user is prompted tospeak into microphone 240. Based on the speaker input, microphone volumeis optimized such that the audio input is neither saturated nor toosmall to be detected. In this regard, an echo test may be performed toobtain the optimized input volume of the microphone 240 and outputvolume of the speaker 230 by controller 201. If no input is detected,the test is a fail.

In preferred embodiments of the invention, the user is notified ondisplay 120 of any failed test. Furthermore, where appropriate, fixapproaches are provided to the user.

While the invention has been described by way of example and in terms ofthe preferred embodiments, it is to be understood that the invention isnot limited to the disclosed embodiments. To the contrary, it isintended to cover various modifications and similar arrangements aswould be apparent to those skilled in the art. Therefore, the scope ofthe appended claims should be accorded the broadest interpretation so asto encompass all such modifications and similar arrangements.

1. A method for improved voice recognition in a system having a set-updevice and a voice recognition device, comprising the steps of:generating a Voice Keyword Table (VKT) and downloading the VKT to thevoice recognition device; upgrading a voice recognition system on thevoice recognition device; and modifying a voice model in the voicerecognition device, whereby the voice recognition is improved. 2-11.(canceled)
 12. The method of claim 1, wherein the step of upgrading anddownloading a voice recognition system to the voice recognition devicecomprises the steps of: downloading an updated voice recognition systemto the set-up device via a network; determining if the updated voicerecognition system is more recent than a voice recognition system on thevoice recognition device; and if the updated voice recognition system ismore recent, downloading the updated voice recognition system from theset-up device to the voice recognition device. 13-16. (canceled)
 17. Themethod of claim 1, further comprising a step of providing customersupport over a network.
 18. The method of claim 1, further comprising astep of providing wireless capable device compatibility supportcomprising instructions for pairing the voice recognition device with awireless capable application device.
 19. A voice recognition systeminstalled on a set-up device for improving voice recognition on a voicerecognition device comprising: a Voice Keyword Table (VKT) generatingmeans for generating a VKT and downloading the VKT to the voicerecognition device; and means for updating voice models on the voicerecognition device. 20-22. (canceled)
 23. The voice recognition systemof claim 19, further comprising means for user-initiated adaptation ofvoice models on the voice recognition device.
 24. The voice recognitionsystem of claim 19 further comprising means for new-modelavailability-initiated adaptation of voice models on the voicerecognition device.
 25. The voice recognition system of claim 24,wherein the means for new-model availability-initiated adaptation usesaccumulated personal acoustic data recorded during user-initiatedadaptation of voice models on the voice recognition device.
 26. Thevoice recognition system of claim 24, wherein the means for new-modelavailability-initiated adaptation uses accumulated personal acousticdata recorded during operation of the voice recognition device toidentify keywords.
 27. The voice recognition system of claim 19, furtherincluding means for upgrading and downloading a voice recognition systemto the voice recognition device.
 28. (canceled)
 29. The voicerecognition system of claim 19, further including means for providingcustomer support via a network.
 30. The voice recognition system ofclaim 19, further including means for providing wireless capable devicecompatibility support comprising instructions for pairing the voicerecognition device with a wireless capable application device.
 31. Anapparatus for improved voice recognition, comprising: a set-up devicecomprising a first Voice Keyword Table (VKT) and a first voicerecognition system; and a voice recognition device comprising a secondVKT corresponding to the first VKT and a second voice recognitionsystem, the voice recognition device connectible to the set-up devicethrough an interface.
 32. (canceled)
 33. The method of claim 31, whereinthe voice recognition device is a Voice Key Pad (VKP) device.
 34. Themethod of claim 31 wherein the voice recognition device is a wirelessearset.
 35. The method of claim 31, wherein the set-up device is apersonal computer (PC).